애니메이션 산업 데이터 분석¶
이 노트북에서는 대규모 애니메이션 데이터를 심층적으로 분석합니다.
이 분석은 충북대학교 컴퓨터공학과 NLP 강좌의 Final Project를 위해 개발된
애니메이션 추천 웹사이트에 추가 기능으로 제공되며,
사용자들에게 애니메이션 산업의 트렌드와 흥미로운 정보를 제공합니다.
주요 목표¶
데이터셋을 분석하고 다음 질문들에 답을 찾는 것을 목표로 합니다:
- 어떤 애니메이션이 다양한 장르와 주제에서 가장 인기 있는가?
- 애니메이션 산업에서 현재 트렌드가 되고 있는 장르와 주제는 무엇인가?
- 어떤 애니메이션 스튜디오와 제작사가 성공을 거두었는가?
- 사용자들이 애니메이션을 어떻게 평가했으며, 그 평가에서 어떤 흥미로운 결론을 도출할 수 있는가?
분석 과정¶
🔍 데이터 시각화와 세부적인 분석을 통해:¶
- 인기 있는 애니메이션 장르와 주제를 파악합니다.
- 애니메이션 산업의 발전 방향을 연구합니다.
- 추천 시스템을 사용자에게 더 흥미롭고 유용하게 만들기 위해 주요 트렌드를 밝혀냅니다.
-Let's get started with the data exploration!
라이브러리 임포트¶
In [13]:
!pip install plotly
Requirement already satisfied: plotly in /opt/anaconda3/lib/python3.12/site-packages (5.24.1) Requirement already satisfied: tenacity>=6.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from plotly) (8.2.3) Requirement already satisfied: packaging in /opt/anaconda3/lib/python3.12/site-packages (from plotly) (24.1)
In [15]:
!pip install wordcloud
Requirement already satisfied: wordcloud in /opt/anaconda3/lib/python3.12/site-packages (1.9.4) Requirement already satisfied: numpy>=1.6.1 in /opt/anaconda3/lib/python3.12/site-packages (from wordcloud) (1.26.4) Requirement already satisfied: pillow in /opt/anaconda3/lib/python3.12/site-packages (from wordcloud) (10.4.0) Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.12/site-packages (from wordcloud) (3.9.2) Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (1.2.0) Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (1.4.4) Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (24.1) Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (2.9.0.post0) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
In [17]:
!pip install langdetect
Requirement already satisfied: langdetect in /opt/anaconda3/lib/python3.12/site-packages (1.0.9) Requirement already satisfied: six in /opt/anaconda3/lib/python3.12/site-packages (from langdetect) (1.16.0)
In [19]:
# Reading Dataset
import numpy as np
import pandas as pd
# Visualization
import plotly.express as px
import plotly.graph_objects as go # for 3D plot visualization
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
from wordcloud import WordCloud
from langdetect import detect
from datetime import datetime
데이터셋 불러오기(Reading our Dataset)¶
In [21]:
# Setting column display to 50
pd.set_option('display.max_columns', 50)
In [24]:
# Importing anime details dataframe
df_anime=pd.read_csv('/Users/giyos/Downloads/archive (1)/anime-dataset-2023.csv')
print("Shape of the Dataset:",df_anime.shape)
df_anime.head(3)
Shape of the Dataset: (24905, 24)
Out[24]:
| anime_id | Name | English name | Other name | Score | Genres | Synopsis | Type | Episodes | Aired | Premiered | Status | Producers | Licensors | Studios | Source | Duration | Rating | Rank | Popularity | Favorites | Scored By | Members | Image URL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Cowboy Bebop | Cowboy Bebop | カウボーイビバップ | 8.75 | Action, Award Winning, Sci-Fi | Crime is timeless. By the year 2071, humanity ... | TV | 26.0 | Apr 3, 1998 to Apr 24, 1999 | spring 1998 | Finished Airing | Bandai Visual | Funimation, Bandai Entertainment | Sunrise | Original | 24 min per ep | R - 17+ (violence & profanity) | 41.0 | 43 | 78525 | 914193.0 | 1771505 | https://cdn.myanimelist.net/images/anime/4/196... |
| 1 | 5 | Cowboy Bebop: Tengoku no Tobira | Cowboy Bebop: The Movie | カウボーイビバップ 天国の扉 | 8.38 | Action, Sci-Fi | Another day, another bounty—such is the life o... | Movie | 1.0 | Sep 1, 2001 | UNKNOWN | Finished Airing | Sunrise, Bandai Visual | Sony Pictures Entertainment | Bones | Original | 1 hr 55 min | R - 17+ (violence & profanity) | 189.0 | 602 | 1448 | 206248.0 | 360978 | https://cdn.myanimelist.net/images/anime/1439/... |
| 2 | 6 | Trigun | Trigun | トライガン | 8.22 | Action, Adventure, Sci-Fi | Vash the Stampede is the man with a $$60,000,0... | TV | 26.0 | Apr 1, 1998 to Sep 30, 1998 | spring 1998 | Finished Airing | Victor Entertainment | Funimation, Geneon Entertainment USA | Madhouse | Manga | 24 min per ep | PG-13 - Teens 13 or older | 328.0 | 246 | 15035 | 356739.0 | 727252 | https://cdn.myanimelist.net/images/anime/7/203... |
In [26]:
# Importing user details dataframe
df_user=pd.read_csv('/Users/giyos/Downloads/archive (1)/users-details-2023.csv')
print("Shape of the Dataset:",df_user.shape)
df_user.head()
Shape of the Dataset: (731290, 16)
Out[26]:
| Mal ID | Username | Gender | Birthday | Location | Joined | Days Watched | Mean Score | Watching | Completed | On Hold | Dropped | Plan to Watch | Total Entries | Rewatched | Episodes Watched | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Xinil | Male | 1985-03-04T00:00:00+00:00 | California | 2004-11-05T00:00:00+00:00 | 142.3 | 7.37 | 1.0 | 233.0 | 8.0 | 93.0 | 64.0 | 399.0 | 60.0 | 8458.0 |
| 1 | 3 | Aokaado | Male | NaN | Oslo, Norway | 2004-11-11T00:00:00+00:00 | 68.6 | 7.34 | 23.0 | 137.0 | 99.0 | 44.0 | 40.0 | 343.0 | 15.0 | 4072.0 |
| 2 | 4 | Crystal | Female | NaN | Melbourne, Australia | 2004-11-13T00:00:00+00:00 | 212.8 | 6.68 | 16.0 | 636.0 | 303.0 | 0.0 | 45.0 | 1000.0 | 10.0 | 12781.0 |
| 3 | 9 | Arcane | NaN | NaN | NaN | 2004-12-05T00:00:00+00:00 | 30.0 | 7.71 | 5.0 | 54.0 | 4.0 | 3.0 | 0.0 | 66.0 | 0.0 | 1817.0 |
| 4 | 18 | Mad | NaN | NaN | NaN | 2005-01-03T00:00:00+00:00 | 52.0 | 6.27 | 1.0 | 114.0 | 10.0 | 5.0 | 23.0 | 153.0 | 42.0 | 3038.0 |
In [28]:
# Importing user score dataframe
df_score=pd.read_csv('/Users/giyos/Downloads/archive (1)/users-score-2023.csv')
print("Shape of the dataset:",df_score.shape)
df_score.head()
Shape of the dataset: (24325191, 5)
Out[28]:
| user_id | Username | anime_id | Anime Title | rating | |
|---|---|---|---|---|---|
| 0 | 1 | Xinil | 21 | One Piece | 9 |
| 1 | 1 | Xinil | 48 | .hack//Sign | 7 |
| 2 | 1 | Xinil | 320 | A Kite | 5 |
| 3 | 1 | Xinil | 49 | Aa! Megami-sama! | 8 |
| 4 | 1 | Xinil | 304 | Aa! Megami-sama! Movie | 8 |
데이터 분석¶
데이터 탐색¶
각 DataFrame 확인¶
데이터를 더 잘 이해하기 위해 각 DataFrame을 개별적으로 확인하는 것이 중요합니다. 이는 DataFrame의 구조를 평가하고 누락된 값을 식별하는 과정을 포함합니다. 우리는 info() 메서드를 사용하여 이 과정을 시작할 것이며, 이는 DataFrame의 열과 구조에 대한 종합적인 개요를 제공합니다.
In [32]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import numpy as np
from PIL import Image
In [33]:
df_anime.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 24905 entries, 0 to 24904 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 anime_id 24905 non-null int64 1 Name 24905 non-null object 2 English name 24905 non-null object 3 Other name 24905 non-null object 4 Score 24905 non-null object 5 Genres 24905 non-null object 6 Synopsis 24905 non-null object 7 Type 24905 non-null object 8 Episodes 24905 non-null object 9 Aired 24905 non-null object 10 Premiered 24905 non-null object 11 Status 24905 non-null object 12 Producers 24905 non-null object 13 Licensors 24905 non-null object 14 Studios 24905 non-null object 15 Source 24905 non-null object 16 Duration 24905 non-null object 17 Rating 24905 non-null object 18 Rank 24905 non-null object 19 Popularity 24905 non-null int64 20 Favorites 24905 non-null int64 21 Scored By 24905 non-null object 22 Members 24905 non-null int64 23 Image URL 24905 non-null object dtypes: int64(4), object(20) memory usage: 4.6+ MB
In [34]:
# Preprocessing Score column
df_anime['Score'].value_counts()
Out[34]:
Score
UNKNOWN 9213
6.31 80
6.54 80
6.25 79
6.51 79
...
3.21 1
3.29 1
1.85 1
3.69 1
4.07 1
Name: count, Length: 567, dtype: int64
In [39]:
scores = df_anime['Score'][df_anime['Score'] != 'UNKNOWN']
scores = scores.astype('float')
score_mean= round(scores.mean() , 2)
In [41]:
df_anime['Score'] = df_anime['Score'].replace('UNKNOWN', score_mean)
df_anime['Score'] = df_anime['Score'].astype('float64')
In [43]:
# Processing Ranked column
df_anime['Rank'].value_counts()
Out[43]:
Rank
UNKNOWN 4612
0.0 187
6542.0 4
16675.0 4
6577.0 4
...
18424.0 1
18423.0 1
11642.0 1
8977.0 1
14536.0 1
Name: count, Length: 15198, dtype: int64
In [45]:
df_anime['Rank'] = df_anime['Rank'].replace('UNKNOWN', np.nan)
df_anime['Rank'] = df_anime['Rank'].astype('float64')
In [47]:
df_user.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 731290 entries, 0 to 731289 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Mal ID 731290 non-null int64 1 Username 731289 non-null object 2 Gender 224383 non-null object 3 Birthday 168068 non-null object 4 Location 152805 non-null object 5 Joined 731290 non-null object 6 Days Watched 731282 non-null float64 7 Mean Score 731282 non-null float64 8 Watching 731282 non-null float64 9 Completed 731282 non-null float64 10 On Hold 731282 non-null float64 11 Dropped 731282 non-null float64 12 Plan to Watch 731282 non-null float64 13 Total Entries 731282 non-null float64 14 Rewatched 731282 non-null float64 15 Episodes Watched 731282 non-null float64 dtypes: float64(10), int64(1), object(5) memory usage: 89.3+ MB
In [49]:
df_score.isnull().sum()
Out[49]:
user_id 0 Username 232 anime_id 0 Anime Title 0 rating 0 dtype: int64
Data Visualization¶
For Anime Dataset¶
In [53]:
# Count the number of anime titles by type
type_counts = df_anime['Type'].value_counts()
# Create a bar chart
fig = px.bar(type_counts, x=type_counts.index, y=type_counts.values, color=type_counts.index, labels={'x':'Anime Type', 'y':'Count'},
title='Count of Anime Titles by Type')
fig.show()
In [55]:
# Filter out anime titles with popularity value 0
df_valid_popularity = df_anime[df_anime['Popularity'] > 0]
# Sort the dataframe by popularity and select the top 15
top_10_popular = df_valid_popularity.sort_values(by='Popularity', ascending=True).head(15)
# Create a bar chart with different colors for each bar
fig = px.bar(top_10_popular, x='Name', y='Popularity',
labels={'Name': 'Anime Title', 'Popularity': 'Popularity'},
title='Top 15 Most Popular Animes',
color='Name')
# Note:- Less the popularity no. is more popular is the anime.
fig.show()
In [57]:
# Create a scatter plot
fig = px.scatter(df_anime, x='Score', y='Members',
labels={'Score':'Overall Score', 'Members':'Number of Scores'},
title='Anime Score vs. Number of Scores')
fig.show()
In [59]:
# Sort the dataframe by the number of users who have scored the anime
top_15_scored = df_anime.sort_values(by='Members', ascending=False).head(15)
# Create a bar chart
fig = px.bar(top_15_scored, x='Name', y='Members', labels={'Members':'Number of Users', 'Name':'Anime Title'},color='Name',
title='Top 15 Animes by Number of Users')
fig.show()
In [61]:
# Split the genres and count their occurrences
genre_counts = df_anime[df_anime['Genres'] != "UNKNOWN"]['Genres'].apply(lambda x: x.split(', ')).explode().value_counts()
# Create a bar chart
fig = px.bar(genre_counts, x=genre_counts.index, y=genre_counts.values,
labels={'x':'Genre', 'y':'Count'},
title='Count of Anime Titles by Genre',
color=genre_counts.index)
fig.show()
In [63]:
# Select the top 20 genres
top_20_genres = genre_counts.head(20)
# Create a bar chart with custom style
fig = px.bar(top_20_genres, x=top_20_genres.index, y=top_20_genres.values,
labels={'x':'Genre', 'y':'Count'},
title='Top 20 Most Popular Genres In The Anime Industry')
# Customize the bar chart appearance
fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
marker_line_width=1.5, opacity=0.8)
fig.update_layout(xaxis_tickangle=-45, xaxis=dict(tickfont=dict(size=12)),
yaxis=dict(titlefont=dict(size=14)))
fig.show()
In [65]:
# Create the plotly figure
fig = go.Figure(data=[go.Pie(labels=top_20_genres.index, values=top_20_genres.values,
hole=0.6, hoverinfo='label+percent', textinfo='value')])
fig.update_layout(title='Distribution of Anime Genres',
legend=dict(font=dict(size=12), title='Genre'),
annotations=[dict(text='Genre', x=0.5, y=0.5, font_size=20, showarrow=False)])
fig.show()
In [67]:
# Concatenate all genre values into a single string
genre_text = ' '.join(df_anime[df_anime['Genres'] != "UNKNOWN"]['Genres'].dropna())
# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(genre_text)
# Convert the WordCloud object to an image
wordcloud_image = wordcloud.to_image()
# Create a Plotly figure to display the WordCloud image
fig = go.Figure(go.Image(z=wordcloud_image))
fig.update_layout(title='Word Embedding Plot - Genre')
fig.show()
In [69]:
# Create a violin plot for anime popularity by type
fig = px.violin(df_anime, x='Type', y='Popularity',
labels={'Type':'Anime Type', 'Popularity':'Popularity'},
title='Distribution of Anime Popularity by Type',
color='Type')
fig.show()
In [71]:
# Create a box plot for anime scores by type
fig = px.box(df_anime, x='Type', y='Score',
labels={'Type':'Anime Type', 'Score':'Score'},
title='Distribution of Anime Scores by Type',
color='Type')
fig.show()
In [73]:
# Create a bubble chart to visualize the relationship between popularity and scored_by
fig = px.scatter(df_anime, x='Popularity', y='Members', size='Score', color='Type',
labels={'Popularity':'Popularity', 'Members':'Number of Scores'},
title='Relationship between Popularity, Number of Scores, and Score')
fig.show()
In [75]:
# Create a 3D scatter plot to visualize the relationship between popularity, scored_by, and score
fig = go.Figure(data=go.Scatter3d(
x=df_anime['Popularity'],
y=df_anime['Members'],
z=df_anime['Score'],
mode='markers',
marker=dict(
size=5,
color=df_anime['Rank'],
colorscale='Viridis',
opacity=0.8
),
text=df_anime['Name'],
hovertemplate='<b>Title</b>: %{text}<br><b>Popularity</b>: %{x}<br><b>Scored By</b>: %{y}<br><b>Score</b>: %{z}',
))
fig.update_layout(scene=dict(
xaxis_title='Popularity',
yaxis_title='Scored By',
zaxis_title='Score'
), title='Relationship between Popularity, Scored By, and Score')
fig.show()
In [77]:
# Create a correlation matrix
correlation_matrix = df_anime[['Score', 'Popularity', 'Rank']].corr()
# Create a heatmap of the correlation matrix
fig = ff.create_annotated_heatmap(z=correlation_matrix.values,
x=list(correlation_matrix.columns),
y=list(correlation_matrix.index),
colorscale='Viridis')
fig.update_layout(title='Correlation Matrix')
fig.show()
In [79]:
df_anime['Licensors'].value_counts()
Out[79]:
Licensors
UNKNOWN 20170
Funimation 957
Sentai Filmworks 818
Discotek Media 275
Aniplex of America 222
...
Bandai Entertainment, Maiden Japan 1
ADV Films, SoftCel Pictures 1
VIZ Media, Media Blasters, Sentai Filmworks, Geneon Entertainment USA 1
Bandai Entertainment, Discotek Media, NYAV Post, Bandai Visual USA 1
Bandai Namco Online 1
Name: count, Length: 265, dtype: int64
In [162]:
# Create a list of all the individual licensors
licensors_list = [licensor.strip() for licensors in df_anime[df_anime['Licensors']!="UNKNOWN"]['Licensors'].str.split(',') for licensor in licensors]
# Count the occurrences of each licensor
licensor_counts = pd.Series(licensors_list).value_counts()
# Filter the licensor_counts series to exclude 'Unknown'
filtered_licensor_counts = licensor_counts[licensor_counts.index != 'Unknown']
# Select the top 10 licensors
top_15_licensors = filtered_licensor_counts.head(10)
# Create the bar plot using Plotly
fig = px.bar(top_15_licensors, x=top_15_licensors.index, y=top_15_licensors.values, color=top_15_licensors.index)
# Customize the plot
fig.update_layout(
title='Top 10 Anime Licensors',
xaxis_title='Licensors',
yaxis_title='Count',
xaxis_tickangle=-45
)
# Show the plot
fig.show()
In [154]:
df_anime['Premiered'].value_counts()
Out[154]:
Premiered
UNKNOWN 19399
spring 2017 88
fall 2016 83
spring 2018 81
spring 2016 78
...
summer 1993 1
summer 1974 1
summer 1991 1
spring 1961 1
summer 2025 1
Name: count, Length: 244, dtype: int64
In [160]:
# Create the pie plot
fig = go.Figure(data=go.Pie(
labels=season_counts.index,
values=season_counts.values,
hole=0.4, # Add a donut hole in the center
hoverinfo='label+percent', # Display label and percentage on hover
textinfo='value', # Display count value as text inside each slice
textfont=dict(size=14), # Set the text font size
marker=dict(
colors=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'], # Custom color palette
line=dict(color='#ffffff', width=2) # Set the color and width of the slice borders
)
))
# Set the title and font style for the plot
fig.update_layout(
title='Distribution of Premiered Seasons',
title_font=dict(size=20),
font=dict(size=12, color='#555555')
)
fig.show()
In [89]:
# Filter out None values from premiered_Year
filtered_premiered_year = premiered_Year.dropna()
# Count the occurrences of each year
year_counts = filtered_premiered_year.value_counts()
# Sort the years in ascending order
sorted_years = sorted(year_counts.index)
# Create the bar plot
fig = go.Figure(data=go.Bar(
x=sorted_years,
y=year_counts[sorted_years],
marker=dict(color='#1f77b4'), # Set the color of the bars
))
# Set the title and axis labels
fig.update_layout(
title='Number of Animes Premiered by Year',
xaxis_title='Year',
yaxis_title='Number of Animes',
title_font=dict(size=20),
font=dict(size=12, color='#555555')
)
fig.show()
In [91]:
# Count the occurrences of each studio
studio_counts = df_anime['Studios'].value_counts()
# Filter the studio_counts series to exclude 'Unknown'
studio_counts = studio_counts[studio_counts.index != 'UNKNOWN']
# Select the top 10 studios with the highest number of animes
top_studios = studio_counts.head(10)
# Create the bar plot
fig = go.Figure(data=go.Bar(
x=top_studios.index,
y=top_studios.values,
marker=dict(color=top_studios.values, colorscale='Blues'), # Set the color of the bars using a colorscale
text=top_studios.values, # Set the text to be displayed on hover
hovertemplate='Studio: %{x}<br>Number of Animes: %{y}<extra></extra>', # Customize the hover template
))
# Set the title and axis labels
fig.update_layout(
title='Number of Animes by Studio (Top 10)',
xaxis_title='Studios',
yaxis_title='Number of Animes',
title_font=dict(size=20),
font=dict(size=12, color='#555555'),
plot_bgcolor='rgba(0, 0, 0, 0)' # Set the background color to transparent
)
fig.show()
In [93]:
# Count the occurrences of each source
source_counts = df_anime['Source'].value_counts()
# Filter the source_counts series to exclude 'Unknown'
source_counts = source_counts[source_counts.index != 'UNKNOWN']
# Create the horizontal bar chart
fig = go.Figure(data=go.Bar(
x=source_counts.values,
y=source_counts.index,
orientation='h', # Set the orientation to horizontal
marker=dict(color=source_counts.values, colorscale='Viridis'), # Set the color of the bars using a colorscale
text=source_counts.values, # Set the text to be displayed on hover
hovertemplate='Source: %{y}<br>Number of Animes: %{x}<extra></extra>', # Customize the hover template
))
# Set the title and axis labels
fig.update_layout(
title='Number of Animes by Source',
xaxis_title='Number of Animes',
yaxis_title='Source',
title_font=dict(size=20),
font=dict(size=12, color='#555555')
)
fig.show()
In [95]:
# Sort the DataFrame by the 'Favorites' column in descending order
sorted_df = df_anime.sort_values('Favorites', ascending=False)
# Select the top 10 most favorited anime
top_favorites = sorted_df.head(10)
# Create the horizontal bar chart
fig = go.Figure(data=go.Bar(
x=top_favorites['Favorites'],
y=top_favorites['Name'],
orientation='h', # Set the orientation to horizontal
marker=dict(color='#1f77b4'), # Set the color of the bars
text=top_favorites['Favorites'], # Set the text to be displayed on hover
hovertemplate='Anime: %{y}<br>Favorites: %{x}<extra></extra>', # Customize the hover template
))
# Set the title and axis labels
fig.update_layout(
title='Top 10 Most Favorited Anime',
xaxis_title='Number of Favorites',
yaxis_title='Anime',
title_font=dict(size=20),
font=dict(size=12, color='#555555')
)
fig.show()
In [97]:
# Creating the treemap plot too fr the above code snippet
fig = go.Figure(go.Treemap(
labels=top_favorites['Name'],
parents=[""] * len(top_favorites),
values=top_favorites['Favorites'],
hovertemplate='Name: %{label}<br>Favorites: %{value}',
))
# Set the color scale
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
'#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
fig.update_traces(marker=dict(colors=colors))
# Set the title
fig.update_layout(
title='Top 10 Most Favorited Anime (Treemap)',
title_font=dict(size=20),
font=dict(size=12, color='#555555'),
)
fig.show()
In [99]:
# Count the occurrences of each rating
rating_counts = df_anime[df_anime['Rating']!="UNKNOWN"]['Rating'].value_counts()
# Filter the rating_counts series to exclude 'Unknown'
rating_counts = rating_counts[rating_counts.index != 'Unknown']
# Create the pie plot
fig = go.Figure(data=go.Pie(
labels=rating_counts.index,
values=rating_counts.values,
hoverinfo='label+percent',
textinfo='value',
textfont=dict(size=12),
marker=dict(colors=['#1f77b4']), # Set the same color for all segments
hole=0.6, # Set the size of the inner hole to create a donut shape
))
# Set the title
fig.update_layout(
title='Distribution of Anime Ratings',
title_font=dict(size=20),
font=dict(size=12, color='#555555'),
)
fig.show()
In [132]:
# Apply language detection to the 'Other name' column
Detected_Language = df_anime[df_anime['Other name']!="UNKNOWN"]['Other name'].apply(detect_language)
# Drop rows where language detection failed (i.e., where Detected_Language is None)
Detected_Language = Detected_Language.dropna()
# Count the occurrences of each language
language_counts = Detected_Language.value_counts()
# Map abbreviated language codes to full names for plotting
language_counts.index = language_counts.index.map(map_language_code)
fig = go.Figure(data=go.Bar(
x=language_counts.values,
y=language_counts.index,
orientation='h',
marker=dict(color=language_counts.values, colorscale='Viridis'),
text=language_counts.values, # Set the text to be displayed on hover
hovertemplate='Native Language: %{y}<br>Number of Animes: %{x}<extra></extra>',
))
# Set the title and axis labels
fig.update_layout(
title='Count of Animes based on its Native Name',
xaxis_title='Number of Animes',
yaxis_title='Native Language',
title_font=dict(size=20),
font=dict(size=12, color='#555555')
)
fig.show()
For User Dataset¶
In [105]:
# Distribution of gender
# Count the occurrences of each gender
gender_counts = df_user['Gender'].value_counts(dropna=True)
# Define custom colors for the pie slices
colors = ['rgb(0, 123, 255)', 'rgb(255, 65, 54)', 'rgb(255, 187, 0)', 'rgb(125, 125, 125)']
# Create the pie plot
fig = go.Figure()
fig.add_trace(go.Pie(
labels=gender_counts.index,
values=gender_counts.values,
hole=0.3,
marker=dict(colors=colors, line=dict(color='#FFFFFF', width=2)),
hoverinfo='label+percent',
hovertemplate='<b>%{label}</b><br>%{percent}',
textinfo='value',
textposition='inside',
sort=False
))
# Customize the layout
fig.update_layout(
title='Gender Distribution',
title_x=0.5,
uniformtext_minsize=12,
uniformtext_mode='hide',
showlegend=False,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
margin=dict(l=20, r=20, t=100, b=20),
)
# Show the plot
fig.show()
In [106]:
df_user['Birthday'].value_counts(dropna=True)
Out[106]:
Birthday
1990-01-01T00:00:00+00:00 177
1989-03-26T00:00:00+00:00 169
1980-01-01T00:00:00+00:00 166
1930-01-01T00:00:00+00:00 153
1991-01-01T00:00:00+00:00 115
...
1966-12-06T00:00:00+00:00 1
2001-11-08T00:00:00+00:00 1
1954-10-16T00:00:00+00:00 1
1958-03-13T00:00:00+00:00 1
2000-10-13T00:00:00+00:00 1
Name: count, Length: 11247, dtype: int64
In [107]:
from datetime import datetime, timezone
# Age Distribution
# Convert birthday to age
def calculate_age(birth_date):
if birth_date != 'NaN':
try:
birth_year = int(birth_date.split('-')[0]) # Extract the birth year
today_year = datetime.now(timezone.utc).year # Use timezone-aware UTC
age = today_year - birth_year
if 10 <= age < 60: # Valid age range (modify as needed)
return age
else:
return None
except:
return None
return None
# Apply age calculation to the 'Birthday' column
Age = df_user['Birthday'].dropna().apply(calculate_age)
# Create the histogram
import plotly.express as px
fig = px.histogram(Age, nbins=20, title='Age Distribution', labels={'value': 'Age', 'count': 'Count'})
# Customize the layout
fig.update_layout(
xaxis=dict(title='Age'),
yaxis=dict(title='Count'),
bargap=0.1,
showlegend=False,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
margin=dict(l=50, r=20, t=100, b=50),
)
# Show the plot
fig.show()
In [108]:
# Location analysis
# Count the occurrences of each location
location_counts = df_user['Location'].value_counts()
# Create a bar chart
fig = px.bar(location_counts.head(20),
x=location_counts.head(20).index,
y=location_counts.head(20).values,
labels={'x': 'Location', 'y': 'Count'},
title='Top 20 User Locations',
color=location_counts.head(20).index)
# Customize the layout
fig.update_layout(
xaxis=dict(title='Location'),
yaxis=dict(title='Count'),
bargap=0.1,
showlegend=False,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)',
margin=dict(l=50, r=20, t=100, b=50),
)
# Show the plot
fig.show()
In [111]:
# Create a correlation matrix
correlation_matrix = df_user[['Days Watched', 'Mean Score', 'Total Entries', 'Rewatched', 'Episodes Watched']].corr()
# Create a heatmap of the correlation matrix
fig = ff.create_annotated_heatmap(z=correlation_matrix.values,
x=list(correlation_matrix.columns),
y=list(correlation_matrix.index),
colorscale='Viridis')
fig.update_layout(title='Correlation Matrix')
fig.show()
For User Score Dataset¶
In [113]:
# Animes that was watched by most users in the df_score dataset
# Get the count of users who watched each anime title
anime_watch_count = df_score.groupby('Anime Title')['user_id'].nunique().reset_index()
anime_watch_count = anime_watch_count.rename(columns={'user_id': 'User Count'})
# Sort the dataframe in descending order by the number of users
anime_watch_count = anime_watch_count.sort_values(by='User Count', ascending=False)
# Select the top 10 anime titles with the highest number of users
top_n = 10
top_anime_watch_count = anime_watch_count.head(top_n)
# Define a colorful color palette
color_palette = px.colors.qualitative.Plotly
# Create the bar chart with colorful bars
fig = px.bar(top_anime_watch_count, x='User Count', y='Anime Title', orientation='h',
title=f'Top {top_n} Anime Titles Watched by Most Users',
labels={'User Count': 'Number of Users', 'Anime Title': 'Anime Title'},
color='User Count',
color_discrete_sequence=color_palette)
# Customize the layout
fig.update_layout(showlegend=False, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)',
margin=dict(l=50, r=20, t=100, b=50))
# Show the plot
fig.show()